Ford GoBike System Data Exploration

Table of Content¶

  • Introduction
  • Preliminary Wrangling
  • Univariate Exploration
  • Bivariate Exploration
  • Multivariate Exploration
  • Conclusion

Introduction¶

This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area. It contains 16 features and well over 150,000 observations.

Import Dataset and Packages¶

The first step of any data analysis project is importing the packages and the dataset to be analyzed

In [1]:
# Import packages 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
import plotly.express as px
In [2]:
# Import Dataset "201902-fordgobike-tripdata.csv" and load into DataFrame 
df = pd.read_csv("201902-fordgobike-tripdata.csv")

Preliminary Wrangling¶

Next, let us inspect the dataset to see if it needs any wrangling.

Check for missing values

In [3]:
df.isnull().sum()
Out[3]:
duration_sec                  0
start_time                    0
end_time                      0
start_station_id            197
start_station_name          197
start_station_latitude        0
start_station_longitude       0
end_station_id              197
end_station_name            197
end_station_latitude          0
end_station_longitude         0
bike_id                       0
user_type                     0
member_birth_year          8265
member_gender              8265
bike_share_for_all_trip       0
dtype: int64

Now, let us drop missing values

In [4]:
df.dropna(inplace=True)
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 174952 entries, 0 to 183411
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration_sec             174952 non-null  int64  
 1   start_time               174952 non-null  object 
 2   end_time                 174952 non-null  object 
 3   start_station_id         174952 non-null  float64
 4   start_station_name       174952 non-null  object 
 5   start_station_latitude   174952 non-null  float64
 6   start_station_longitude  174952 non-null  float64
 7   end_station_id           174952 non-null  float64
 8   end_station_name         174952 non-null  object 
 9   end_station_latitude     174952 non-null  float64
 10  end_station_longitude    174952 non-null  float64
 11  bike_id                  174952 non-null  int64  
 12  user_type                174952 non-null  object 
 13  member_birth_year        174952 non-null  float64
 14  member_gender            174952 non-null  object 
 15  bike_share_for_all_trip  174952 non-null  object 
dtypes: float64(7), int64(2), object(7)
memory usage: 22.7+ MB

Some columns have wrong Datatypes, now let us change them to the accurate Datatypes

In [6]:
# Change dtype from 'float' to 'int'
cols = ["start_station_id","end_station_id","member_birth_year"]
for col in cols:
    df[col] = df[col].astype(int)
In [7]:
# Check
df[cols].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 174952 entries, 0 to 183411
Data columns (total 3 columns):
 #   Column             Non-Null Count   Dtype
---  ------             --------------   -----
 0   start_station_id   174952 non-null  int32
 1   end_station_id     174952 non-null  int32
 2   member_birth_year  174952 non-null  int32
dtypes: int32(3)
memory usage: 3.3 MB

Notice, the columns "start_time" and "end_time" contain Datetime objects but are encoded as Objects.

Let us convert them to datetime objects

In [8]:
cols = ["start_time","end_time"]
for col in cols:
    df[col] = pd.to_datetime(df[col])
In [9]:
# Check
df[cols].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 174952 entries, 0 to 183411
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   start_time  174952 non-null  datetime64[ns]
 1   end_time    174952 non-null  datetime64[ns]
dtypes: datetime64[ns](2)
memory usage: 4.0 MB
In [10]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 174952 entries, 0 to 183411
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration_sec             174952 non-null  int64         
 1   start_time               174952 non-null  datetime64[ns]
 2   end_time                 174952 non-null  datetime64[ns]
 3   start_station_id         174952 non-null  int32         
 4   start_station_name       174952 non-null  object        
 5   start_station_latitude   174952 non-null  float64       
 6   start_station_longitude  174952 non-null  float64       
 7   end_station_id           174952 non-null  int32         
 8   end_station_name         174952 non-null  object        
 9   end_station_latitude     174952 non-null  float64       
 10  end_station_longitude    174952 non-null  float64       
 11  bike_id                  174952 non-null  int64         
 12  user_type                174952 non-null  object        
 13  member_birth_year        174952 non-null  int32         
 14  member_gender            174952 non-null  object        
 15  bike_share_for_all_trip  174952 non-null  object        
dtypes: datetime64[ns](2), float64(4), int32(3), int64(2), object(5)
memory usage: 20.7+ MB

Univariate Exploration¶

Let us explore the distribution of some of the variables in our dataset

user_type¶

Let us see the distribution of the observations in the "user_type" variable

In [11]:
fig,ax = plt.subplots(figsize=(10,6))
color = sns.color_palette()[0]
sns.countplot(data = df, x = "user_type",color= color)
plt.xlabel("User Type")
plt.ylabel("Frequency [Count]")
plt.title("Distribution of User Types");
print(df.user_type.value_counts(normalize=True))
Subscriber    0.905311
Customer      0.094689
Name: user_type, dtype: float64

We can see that the distribution of user types in our dataset is massively unbalanced with over 90% of users being Subscribers.

duration_sec¶

Next, let us look at the distribution of the "duration_sec" variable using a histogram

In [12]:
# plot 'duration_sec' histogram
bin_range = np.arange(0,df.duration_sec.max()+200,10)
plt.hist(data= df,x="duration_sec",bins=bin_range)
plt.xlabel("Time:[sec]")
plt.ylabel("Frequency [count]")
plt.title("Distribution of Time [Sec]");

We can see from the visualizayion that the duration_sec variable is highly Right skewed but let us go further remove the top and bottom ten values and see if we can get a better visualization of these distribution.

In [13]:
# Get the 10th and 90th quantile
low,high = df.duration_sec.quantile([0.1,0.9])
# Extract mask of values between "low" and "high" and plot
mask = df.duration_sec.between(low,high)
df_mask = df[mask]
sns.displot(df_mask["duration_sec"])
plt.xlabel("Time:[sec]")
plt.ylabel("Frequency [count]")
plt.title("Distribution of Trimmed Time [Sec]");

Now we can clearly see that Right-Skewed distrubtion in our duration_sec variable.

member_gender¶

Next, let us look at the proportion (distribution) of the genders in our DataFrame

In [14]:
fig,ax = plt.subplots(figsize=(10,6))
df.member_gender.value_counts().plot.pie()
plt.xlabel("User Gender")
plt.ylabel("Frequency [Count]")
plt.title("Distribution of User Gender");
print(df.member_gender.value_counts(normalize=True))
Male      0.745919
Female    0.233235
Other     0.020846
Name: member_gender, dtype: float64

We can see that over 75% of our users are Males and just about 2% of our users do not identify as Male or Female.

member_birth_year¶

Let us explore the distribution of member_birth_year

In [15]:
fig,ax = plt.subplots(figsize=(20,6))
color = sns.color_palette()[0]
sns.countplot(data = df, x="member_birth_year",color = color)
plt.xticks(rotation=90)
plt.title("Distibution of Birth Dates");

We can see that the bulk of the members in our DataFrame were born between 1982 - 1996

bike_share_for_all_trip¶

In [16]:
fig,ax = plt.subplots(figsize=(10,6))
sns.countplot(data = df, x= "bike_share_for_all_trip",color = sns.color_palette()[0])
plt.ylabel("Frequency [count]")
plt.xlabel("Bike shared for all trip?")
plt.title("Distribution of bike_share_for_all_trip")
print(df.bike_share_for_all_trip.value_counts(normalize=True))
No     0.900853
Yes    0.099147
Name: bike_share_for_all_trip, dtype: float64

We can see that only about 10 percent of the riders in our dataset share the bikes during their trips

start_station_id¶

Even though start_station_id is Numerical, we know it contains a categorical Data, so let us use a countplot to check the distribution

In [17]:
# Plot a counplot of "start_station_id"
fig,ax = plt.subplots(figsize=(20,6))
sns.countplot(data= df,x="start_station_id",color = sns.color_palette()[0])
plt.xticks([])
plt.title("Distribution of Station IDs");
print(f"Number of Unique Station IDs : {df.start_station_id.nunique()}")
Number of Unique Station IDs : 329

Our plot is difficult to interpret because it has a lot of categories - 329 Categories

Bivariate Exploration¶

Now let us look at the relationship between Variable(s) in our DataFrame.

  1. What is the proportion of the genders in our DataFrame are customer and subscriber?

We can start by visualizing the relationship between variables member_gender and user_type

In [18]:
# plot side-by-side countplot of 'member_gender' and 'user_type'
fig,ax= plt.subplots(figsize=(10,6))
order =df.member_gender.value_counts().index
fig =sns.countplot(data=df,x="member_gender", hue="user_type",order=order)

While we can assume that higher proportion of Males are subscribers than the other two gender categories, it would be better to look at the actual numbers

In [19]:
df.user_type.groupby([df.member_gender]).value_counts(normalize=True).to_frame()
Out[19]:
user_type
member_gender user_type
Female Subscriber 0.886705
Customer 0.113295
Male Subscriber 0.911931
Customer 0.088069
Other Subscriber 0.876611
Customer 0.123389

We can see that indeed 91% of Male users are subscribers in contrast with about 89% and 88% for Female and Other respectively.

  1. Next, We want to find out which day(s) of week has the highest average time in seconds that bike riders in our DataFrame spend on the road.

For this exploration, we will do some Feature Engineering to extract the feature day_of_week from the variable start_tine

In [20]:
# create feature "day_of_week"
df["day_of_week"] = df["start_time"].dt.dayofweek

Let us change the Encoding of day_of_week

In [21]:
# Create dw_mapping dic
dw_mapping={
    0: 'Monday', 
    1: 'Tuesday', 
    2: 'Wednesday', 
    3: 'Thursday', 
    4: 'Friday',
    5: 'Saturday', 
    6: 'Sunday'
} 
In [22]:
# Change 'day_of_week' encoding
df['day_of_week']=df['day_of_week'].map(dw_mapping)

Now, Let us conduct a Bivariate Exploration of the variables day_of_week and duration_sec to determine which Day of the week bikers spent the highest time [sec].

In [23]:
# Aggegrate "day_of_week" by mean "duration_sec"
day_average = df.duration_sec.groupby(df.day_of_week).mean()
day_average
Out[23]:
day_of_week
Friday       681.274699
Monday       696.970906
Saturday     849.771750
Sunday       858.673649
Thursday     678.289748
Tuesday      652.839459
Wednesday    665.127946
Name: duration_sec, dtype: float64

Since Days of the week is an ordinal data, let us rearrange the index accordingly.

In [24]:
day_order = ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
day_average = day_average.reindex(day_order)
day_average
Out[24]:
day_of_week
Monday       696.970906
Tuesday      652.839459
Wednesday    665.127946
Thursday     678.289748
Friday       681.274699
Saturday     849.771750
Sunday       858.673649
Name: duration_sec, dtype: float64
In [25]:
fig,ax = plt.subplots(figsize=(10,6))
color = ["lightblue","lightblue","lightblue","lightblue","lightblue","lightblue","orange"]
plt.bar(day_average.index, day_average.values,color=color)
plt.xlabel("Day of the week")
plt.ylabel("Average Duration [Sec]")
plt.title("Average Duration [sec] by day");

We can see that Bike riders in our Dataset spent more time on an average riding on Sundays

  1. Let us see the relationship between duration_sec and member_gender

Let us use a boxplot to visualize the relationship between the various categories of genders in our Dataset and the time they spent riding

In [26]:
# Create boxplot of 'member_gender' vs 'duration_sec'
sns.boxplot(x= "member_gender",y="duration_sec",data=df,order = ["Male","Female","Other"],
           color = sns.color_palette()[0]);

We can clearly see the outliers we saw during the univariate exploration of duration_sec

Let us create another boxplot using df_mask the dataframe we created by trimming the duration_sec variable

In [27]:
sns.boxplot(x= "member_gender",y="duration_sec",data=df_mask,order = ["Male","Female","Other"],
           color = sns.color_palette()[0]);

We can see that 50% of Female riders spent more time riding than their Male and Other counterparts

Finally,

Let us check the correlation between variables in our Dataset

In [28]:
plt.subplots(figsize=(10,6))
corr =df.corr()
sns.heatmap(corr,cmap="Blues");

We can see expected high levels of correlations between Location variables 'start_station_latitude', and 'end_station_latitude' and also between 'start_station_longitude' and 'end_station_longitude' and also some moderate to low levels of correlation between different other variables in our Dataset.

Multivariate Exploration¶

Using Plotly, Let us visualize the location of our bike stations in a map of San Francisco.

In [29]:
# Create a scatter_mapbox of bike stations 
fig = px.scatter_mapbox(
    df,
    lat = 'start_station_latitude',
    lon = 'start_station_longitude',
    width = 600,
    height = 600,
    center = {"lat":37.804562,
             "lon":-122.24878},
    title = "Map of San Fransico Bay showing the Various Bike Stations",
    hover_data = ["start_station_name"]
    
)
fig.update_layout(mapbox_style = "carto-positron")
fig.show(renderer='notebook')

We can see the locations of the various bike stations in our Dataset on the map of San Francisco

Next, let us check the distibution of duration_sec for each of each category of user_type in our Dataset for each day_of_week

In [30]:
# Create function to plot boxplot of 'duration_sec' by 'user_type'
# on facets of 'day_of_week'
def plot_facetgrid(df, cat_col1, cat_col2, num_col):
    """Takes in Four paramters and returns a Seaborns Facet Grid Plot
    
        Parameters:
            df: DataFrame containing the data
            cat_col1 (str) : name of categorical column in df to facet on
            cat_col2 (str) : name of categorical column in df to group by 
            num_col (str) : name of numerical column in df to plot with
        Returns:
        plot : Facet Grid plot
    """
    g = sns.FacetGrid(data = df, col = cat_col1, height = 4,col_wrap = 3, col_order = day_order)
    g.map(sns.boxplot, cat_col2, num_col , order =['Subscriber', 'Customer'])
    return g
In [31]:
# Plot with 'plot_facetgrid'
plot_facetgrid(df,"day_of_week","user_type","duration_sec");

We see a lot of Outlier in our boxplots due to the highly skewed nature of duration_sec, so let us do a little wrangling before we create the Visualization again

In [32]:
# Get the 10th and 90th quantile for 'duration_sec'
low,high = df.duration_sec.quantile([0.1,0.9])
# Extract mask of values between "low" and "high" and plot
mask = df.duration_sec.between(low,high)
df_mask = df[mask]

Let us use our plot_facetgrid function to make another boxplot but this time with df_mask

In [33]:
plot_facetgrid(df_mask,"day_of_week","user_type","duration_sec");

We can see that even though we have alot more Subscriber than Customers, The Customers spent more time riding on every single day of the week.

Let us look at the numbers to support the intuition from the boxplot

In [34]:
df.duration_sec.groupby([df.user_type,df.day_of_week]).describe()
Out[34]:
count mean std min 25% 50% 75% max
user_type day_of_week
Customer Friday 2541.0 1114.279024 2192.160785 63.0 481.00 761.0 1140.00 63600.0
Monday 2309.0 1366.174968 3761.378108 70.0 499.00 778.0 1234.00 83407.0
Saturday 2171.0 1642.627821 4294.763915 79.0 577.00 937.0 1585.50 83519.0
Sunday 2299.0 1667.036973 3535.710756 62.0 575.50 949.0 1622.50 69335.0
Thursday 2868.0 1181.823222 3232.611989 81.0 475.00 731.0 1096.00 82512.0
Tuesday 2268.0 1034.151675 2543.888349 64.0 469.75 719.0 1071.25 79548.0
Wednesday 2110.0 1228.843602 3917.437145 66.0 470.25 699.5 1055.75 71470.0
Subscriber Friday 25122.0 637.477868 1353.972845 61.0 319.00 494.0 750.00 74408.0
Monday 23332.0 630.744600 1301.012802 61.0 308.00 483.0 737.00 72590.0
Saturday 12243.0 709.177898 1678.976147 62.0 304.00 491.0 795.00 84548.0
Sunday 12213.0 706.505691 1564.696244 61.0 294.00 485.0 793.00 73930.0
Thursday 30844.0 631.469167 1169.289247 61.0 316.00 495.0 755.00 63867.0
Tuesday 28316.0 622.297853 1228.616082 61.0 314.00 489.0 739.00 70211.0
Wednesday 26316.0 619.929587 1211.139902 61.0 313.00 489.0 737.00 83195.0

Conclusion¶

$1$. The features user_type, member_gender and bike_share_for_all_trip are highly Imbalanced in our Dataset.

$2$. duration_sec is highly Right Skewed.

$3$. More bike riders were born in 1988 than in any other year in our Dataset.

$4$. San Fransico Bay has 329 bike stations.

$5$. Bike riders in our Dataset spent more time on an average riding on Sundays

$6$. On an average, Female riders spend more time[sec] riding than Male riders.

$7$. The Customers spent more time riding on every single day of the week than subscribers did.